Chris Pollett > Old Classses > CS267
( Print View )

Student Corner:
  [Submit Sec1]
  [Grades Sec1]

  [
Lecture Notes]
  [Discussion Board]

Course Info:
  [Texts & Links]
  [Description]
  [Course Outcomes]
  [Outcomes Matrix]
  [Course Schedule]
  [Grading]
  [Requirements/HW/Quizzes]
  [Class Protocols]
  [Exam Info]
  [Regrades]
  [University Policies]
  [Announcements]

HW Assignments:
  [Hw1]  [Hw2]  [Hw3]
  [Hw4]  [Hw5]  [Quizzes]

Practice Exams:
  [Mid1]  [Mid2]   [Final]

                           












HW#2 --- last modified January 01 1970 00:00:00.

Solution set.

Due date:

Files to be submitted:
  Hw2.zip

Purpose:

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO1 (Course Learning Outcome 1) -- Code a basic inverted index capable of performing conjunctive queries.

CLO2 -- Be able to calculate by hand on small examples precision (fraction relevant results returned), recall (fraction of results which are relevant), and other IR statistics.

CLO6 -- Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework consists of two parts a coding part and an experiment part. Submit all code and documents in the file Hw2.zip that you submit. For the coding part, I want you to write a program which is capable of computing the results of conjunctive queries on a corpus, ranked by one of the ranking mechanisms we are learning. Your program will be run from the command line with a line like:
aux_program SearchIndex some_file rank_method num_results query

Here aux_program is at most one of java, php, python (your choice). For php or python, IndexPrinter should be either IndexPrinter.php or IndexPrinter.py, respectively If you choose to do this project in C or C++, aux_program is optional. Your program will be compiled on a Mac running MacOS High Sierra, having used brew to install the compiler. filename is the name of some text document. This document is assumed to contain only ASCII characters with '\n' used for line endings. For this assignment, the sequence '\n\n' indicates a new "document" within this file. In the above, rank_method is one of cos or proximity. This should determine how the search results you output are ranked and sorted. cos should mean using cosine similarity with TF-IDF column values. For proximity positions are term positions, not character positions as used in the first HW. num_results indicates the top how many of the results satisfying the query should be returned. The query is a quoted string of one of more querty terms. For this homework, two terms are the same if they are the same after stripping punctuation and putting the terms to lower case. An example of running the program might be:

python SearchIndex.py my_corpus.txt cos 5 "The quick brown fox"

The output should be a line with DocId Score on it, followed by a sequence of num_result lines with this information for the top num_results many documents. For example,

DocId Score
7 .65
2 .51
3 .23
11 .0012

You should use the ADT from the book for inverted index retrieval and implement your program using calls to the first, last, prev, next methods. The latter two should be implemented using galloping search.

Once you have coded your program, I want you to come up with 10 paragraphs from Wikipedia pages of your choice. These will be your corpus. Make two, at-least-two-term queries, and for each query, determine which documents are relevant. Using your program determine the top 5 documents returned for each of the two scoring functions. Then by hand compute the precision@5 and recall@5 scores. Write up your results and say which scoring system seemed the best.

Point Breakdown

Code is reasonably well-commented and structured 1pt
Code compiles and outputs as described on test corpus 1pt
Can find in code where Inverted Index ADT implemented (1pt) and next and prev implemented using galloping search.(1pt) 2pt
Each of different scoring metrics computed correctly (1.5pt each) 3pts
Code correctly retrieves docs with all the query terms where two terms are equal as described above and correct number of results returned. 1pt
Experiments and write-up in a file Experiments.pdf 2pts
Total10pts